Tiffany Chan
Unsupervised Learning Assignment Credit Card Segmentation Project
#Import the basic fundamental libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Bring in the credit card customer data
ccdata= pd.read_excel('Credit Card Customer Data.xlsx')
Perform univariate analysis on the data to better understand the variables at your disposal and to get an idea about the no of clusters. Perform EDA, create visualizations to explore data. (10 marks)
Properly comment on the codes, provide explanations of the steps taken in the notebook and conclude your insights from the graphs. (5 marks)
#Evaluating the first 5 customer results from the data
ccdata.head()
#Let's look at the shape of the data and see if there are any missing values.
#If there are missing values, we can choose to impute them
print("Shape of dataset")
print(ccdata.shape) #Shape of the data
print("")
print("Number of missing values in each variable") #Checking for missing values
print(ccdata.isnull().sum())
print("")
print("Table of Possible Duplicates in the data")
print(ccdata[ccdata.duplicated(keep='first')]) #Recording which cases are duplicates
print("")
In the original dataset, there are in total 660 reported cases, and 7 columns. There are no duplicates in this data, therefore we do not have to perform imputations on any missing data. However, we may have to do imputations if there are extreme values.
#Looking at descriptive statistics.
ccdata.describe()
Looking at the descriptive statistics of this dataset, we know that the data needs to be standardized, so that certain columns do not get more weight than others. We can see this because the ranges differ. Avg_Credit_Limit has a higher range compared to all the other variables. Standardization helps to even out the weights between the variables. Of course, we should ignore the SI_No and Customer_Key columns because they are identity variables and do not carry quantitative value.
UNIVARIATE ANALYSIS
#Univariate analysis for Avg_Credit_Limit
#Looking at the histogram/distplot and the frequencies of average credit limit in the dataset
print(sns.distplot(ccdata['Avg_Credit_Limit'])) #Histogram/Distplot
print(ccdata['Avg_Credit_Limit'].value_counts()) #Frequencies
#Boxplot for Avg_Credit_Limit
print(sns.boxplot(ccdata['Avg_Credit_Limit'])) #Boxplot
This histogram/distplot shows that this variable is rightly skewed and may have a lot of outliers beyond the upper whisker in the traditional sense. Because we are dealing with k means and several hierarchical linkage methods that may be sensitive to outliers, we should replace their value with the mean, median or whisker value. Let's explore further whether we should proceed with this decision by conducting more EDA:
#We need to see if it would be worth it to impute the outliers in this feature.
#First we need the following measurements first.
q75, q25 = np.percentile(ccdata['Avg_Credit_Limit'], [75 ,25])
iqr = q75 - q25
print(q75) #Find 75th percentile
q75 + 1.5*iqr #Find value at upper whisker in order to set limits to find the outliers.
#We need to see if it would be worth it to impute the outliers in this feature.
Outliers = ccdata[(ccdata['Avg_Credit_Limit'] > 105000.0)]
len(Outliers) #Count the number of outliers and then determine if it's worth imputing
39 out 660 customers is 5.9% and is considered little, and we can therefore impute these numbers with the value of the upper whisker. The reason we should do this is because K means is sensitive to outliers. Average link is also known to be sensitive to outliers. Both methods calculate means, which are highly influenced by outliers.
#Replace the outliers with the upper whisker value
ccdata.Avg_Credit_Limit[ccdata['Avg_Credit_Limit']> 105000.0] = 105000.0
ccdata[['Avg_Credit_Limit']].boxplot()
No more outliers in this feature.
#Univariate analysis for Total_Credit_Cards
#Looking at the histogram/distplot and the frequencies of Total_Credit_Cards in the dataset
print(ccdata['Total_Credit_Cards'].value_counts()) #Frequencies
print(sns.distplot(ccdata['Total_Credit_Cards'])) #Distplot/histogram
#Boxplot for Total_Credit_Cards
print(sns.boxplot(ccdata['Total_Credit_Cards']))
It is clear from the boxplot and the distplot/histogram that there are no outliers beyond the upper and lower whiskers for total number of credit cards. There could possibly be 4 clusters according to the kde shown here, but we are only observing this on a 1-dimensional scale. We will postpone further discussion on potential cluster number when we look at the scaled pairplot later on in this assignment.
#Univariate analysis for Total_visits_bank
#Looking at the histogram/distplot and the frequencies of Total_visits_bank in the dataset
sns.distplot(ccdata['Total_visits_bank'])
ccdata['Total_visits_bank'].value_counts()
#Boxplot for Total_vote_bank
sns.boxplot(ccdata['Total_visits_bank']) #Boxplot
It is clear from the boxplot and the distplot/histogram that there are no outliers beyond the upper and lower whiskers for total bank visits.
#Univariate analysis for Total_calls_made
#Looking at the histogram/distplot and the frequencies of Total_calls_made in the dataset
sns.distplot(ccdata['Total_calls_made'])
ccdata['Total_calls_made'].value_counts()
#Boxplot for Total_calls_made
sns.boxplot(ccdata['Total_calls_made'])
There are no traditional outliers for total calls made in this dataset.
BIVARIATE ANALYSIS AND SCALING
# K Means and machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
#Import the following for scaling.
from scipy.stats import zscore
#Pairplot for bivariate analysis.
#Use diagonal univariate analysis to count the possible number of clusters for K means.
ccdata1=ccdata.iloc[:,2:] #Exclude the ID variables: Customer Key and SI_No
ccdatascaled=ccdata1.apply(zscore) #For scaling. Make the mean = 0, and sd = 1
sns.pairplot(ccdatascaled,diag_kind='kde') #Pairplot code, set the diagonal with kde.
To determine the number of minimum clusters, we have to focus on the diagnonal, which is basically a scaled version of the histogram/distplots with the superimposed kde plotted above. It seems that in evaluating total credit cards, the possible minimum number of clusters could be 4. This is only a possibility. There could definitely be more clusters due to the presence of these other variables/dimensions. We are only observing this using 1-2 dimensions. If we incorporated the other variables, there may be other cluster(s) that are not immediately obvious to the eye in the 2-dimensional space.
#Finding optimal no. of clusters using the Elbow plot
from scipy.spatial.distance import cdist #Used to calculate distances between points
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k) # K Means
model.fit(ccdatascaled) #Fit the model on scaled data
prediction=model.predict(ccdatascaled) # We will use the k means model to predict on the scaled data
meanDistortions.append(sum(np.min(cdist(ccdatascaled, model.cluster_centers_, 'euclidean'), axis=1)) / ccdatascaled.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
The elbow plot is subjective, but can be a useful guide to determing a good number of clusters for k means clustering. It is advised to choose the k value where there is an "elbow" in the plot. This is when the within group sum of squares or variation shifts very little, when the line graph is becoming straighter and has the least amount of variance influence. In this case, it seems like when k = 3, the line graph begins to straighten out horizontally and most of the variation is already captured by 3 clusters. Adding a fourth cluster would not significantly add to explaining the variation of the data. Perhaps we should also explore k = 4, and examine their respective silhouette score.
Execute hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot (15 marks)
Calculate average silhouette score for both methods. (5 marks)
PERFORMING K MEANS, EVALUATING K = 3
# Let us first start with K = 3
final_model=KMeans(3)
final_model.fit(ccdatascaled)
prediction=final_model.predict(ccdatascaled)
#Append the prediction and create a variable called "GROUP", which are the different clusters, for the regular data and the scaled data.
ccdata1["GROUP"] = prediction
ccdatascaled["GROUP"] = prediction
print("Groups Assigned : \n")
ccdata1.head()
The 3 group/cluster numbers are: 0, 1, and 2.
##Calculating the means of every variable in the unscaled data and organizing it by cluster using groupby
ccdataclust = ccdata1.groupby(['GROUP'])
ccdataclust.mean()
Those that are have higher average of average credit limit have a higher average of credit cards and tend to bank online more frequently. Those who are of the middle tier in terms of average credit limit are more likely to visit and/or call the bank than they are to engage in online banking.
#Observe the boxplots for the scaled data and visualize if k means clustering is a good segmentation method for this data.
ccdatascaled.boxplot(by='GROUP', layout = (2,4),figsize=(15,10))
Additional statistical method to determine independence between clusters: Visually, these boxplots show much overlap when the data is segmented into 3 clusters (k = 3). One method to ascertain statistically significant overlap is to use a statistical test like ANOVA, if you want to see if all 3 clusters are independent of one another when it comes to these 5 variables. The only downside of ANOVA would be that it can only tell if at least 1 cluster is not indpendent of another cluster. It won't tell which cluster is independent or not independent.
Another observation is that there are outliers in particular clusters for certain variables, such as Group 1 and Group 2 in Avg_Credit_Limit and Group 1 in Total_visits_online. These outliers were not present in the original dataset after we handled the outliers for Avg_Credit_Limit. The reason for these is because in K means the centroids are randomly inserted into the multi-dimensional space among the observations. These distances between each observation and the centroids are subject to change depending on where the centroid is randomly placed.
However, calculating the silhouette score is a metric that is also effective for k means clustering analysis.
#Calculate silhouette score
from sklearn.metrics import silhouette_score
k_means_3_score = silhouette_score(ccdatascaled, final_model.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % k_means_3_score)
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.
silhouette_score_K_Means_3 ={'Metric':['Silhouette Score'], 'K Means(3 Clusters)':[k_means_3_score]}
dataframe1 = pd.DataFrame(silhouette_score_K_Means_3)
dataframe1
The silhouette score can determine how separated are the clusters. It measures the average of how similar observations in multi-dimensional space are to their own cluster compared to other neighboring clusters.
The silhouette score ranges from -1 to 1. 1 means that the average of all points are closer to their respective cluster than the neighboring cluster(s). -1 means that the average of all points are closer to their neighboring clusters compared to their designated cluster.
For k = 3, the Silhouette score is 0.53, which is decent. Anything well above 0.5 is considered good.
EVALUATING K = 4
# Let's now try K = 4
final_model=KMeans(4)
final_model.fit(ccdatascaled)
prediction=final_model.predict(ccdatascaled)
#Append the prediction to the original data and the scaled data.
ccdata1["GROUP"] = prediction
ccdatascaled["GROUP"] = prediction
print("Groups Assigned : \n")
ccdata1.head() #Making sure the new GROUP variable is appended.
#Calculating the means of every variable in the unscaled data and organizing it by cluster using groupby
ccdataclust4 = ccdata1.groupby(['GROUP'])
ccdataclust4.mean()
For 4 clusters, the group with the highest average credit limit are still most active online. There is a middle group that still favors bank visits and telephone banking calls. However, there are 2 clusters that have the lowest average credit limit. Cluster 2 is more similar to the lower tier group in the 3 clusters analysis. Group 1 is very different, as these individuals have a preference for total banks visits compared to all other means
#Boxplots to show clustering with k = 4
ccdatascaled.boxplot(by='GROUP', layout = (2,4),figsize=(15,10))
Looking at these boxplots, there seems to be a lot more overlap between the clusters here than when k = 3. We could potentially see a decrease in the silhouette score. Like what was mentioned before, the presence of outliers are attributed where the centroids are located in reference to the observations.
#Calculating silhouette score for k = 4
from sklearn.metrics import silhouette_score
k_means_4_score = silhouette_score(ccdatascaled, final_model.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % k_means_4_score)
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.
silhouette_score_K_Means_4 ={'Metric':['Silhouette Score'], 'K Means(4 Clusters)':[k_means_4_score]}
dataframe2 = pd.DataFrame(silhouette_score_K_Means_4)
dataframe2
It seems that k=3 is the better number of clusters for this data when it comes to K Means. The silhouette score decreased by a little (approx. 0.003), which suggests that splitting into a 4th cluster doesn't really contribute much to explaining the variation in the data.
HIERARCHICAL CLUSTERING:
#In order to not confuse ourselves, let us create a new dataframe that eliminates the GROUP variable formed from K means before we attempt hierarchical clustering
print("Unscaled data as it is now")
print(ccdata1.head()) #Unscaled data as it is now
print("")
print("Scaled data as it is now")
print(ccdatascaled.head()) #Scaled data as it is now
print("")
#Create new dataframe with removed GROUP feature/variable
ccdata2=ccdata1.iloc[:,:5]
ccdatascaled2=ccdatascaled.iloc[:,:5]
#Viewing the unscaled and scaled dataframes for hierarchical clustering
print("New unscaled dataframe with dropped GROUP variable/feature")
print(ccdata2.head())
print("")
print("New scaled dataframe with dropped GROUP variable/feature")
ccdatascaled2.head()
#Let's use the agglomerative clustering technique
from sklearn.cluster import AgglomerativeClustering
HIERARCHICAL CLUSTERING USING AVERAGE LINKAGE METHOD FOR 3 CLUSTERS
#We can use the average linkage method and then fit it to the scaled data.
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')
model.fit(ccdatascaled2)
#We can also append these as labels to the unscaled and scaled data, like what we did previously when we discussed K means.
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
#Checking to see if the labels are appended to the scaled data. We need this to make the boxplots.
ccdatascaled2.head()
#Boxplots to show hierarchical clustering by label (3 clusters) using average link
ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
There is still a lot of overlap between the different clusters, which suggests that these clusters may not be entirely separated from each other. Like previously, there are some new outliers but these as a result of the hierarchical clustering.
#We can evaluate all the means according to each feature by label/cluster.
ccdatacluster.mean()
#Import libraries to calculate the cophenetic coefficient, to create the dendogram with different linkages
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist # This is for pairwise distribution between data points
# calculating cophenetic coefficient
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z = linkage(ccdatascaled2, metric='euclidean', method='average')
c, coph_dists = cophenet(Z , pdist(ccdatascaled2))
c
#Summarize into a dataframe so that we can merge this information into a bigger dataframe for cophnetic coefficient comparison between models
Cophenetic_coeff_avg ={'Metric':['Cophenetic Coefficient'], 'Average Linkage CC':[c]}
dataframe11 = pd.DataFrame(Cophenetic_coeff_avg)
dataframe11
This cophenetic measure shows high correlation between the euclidean distance between points in the multi-dimensional space and the dendogram distance.
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
The big difference between hierarchical clustering (dendograms) and K means is that for hierarchical clustering, we need to draw out the entire hierarchical tree. What's great about K means is that we can specify the number of clusters we would like. We need to read the dendogram and decide how many clusters we would like. In order to calculate the silhouette score, we need to cut the dendogram according to the last p merged clusters.
#For the sake of comparison to the K means models, we should specify 3 and 4 clusters
#Let's start with specifying p = 3
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=3, # show only the last p merged clusters
)
plt.show()
max_d = 4.2 # There are 3 clusters formed at approximately 3.3
#We need to check if the max_d is an appropriate estimation by looking at the clusters distribution in an array form.
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
After checking this array, we can see there are 1s,2s, and 3s. So, because we see 3 groups, our estimation of max_d is appropriate.
#Let's calculate the silhouette coefficient for 3 clusters using average linkage.
hc_3_clusters_silh_avg = silhouette_score(ccdatascaled2,clusters)
hc_3_clusters_silh_avg
This silhouette coefficient falls short of decent. 0.5 is normally construed decent.
#Make a dataframe with this information so that we can merge it into a big table and compare this measure to the others.
silhouette_score_hc_avg_3 ={'Metric':['Silhouette Score'], 'Average(3 Clusters)':[hc_3_clusters_silh_avg]}
dataframe3 = pd.DataFrame(silhouette_score_hc_avg_3)
dataframe3
HIERARCHICAL CLUSTERING USING AVERAGE LINKAGE METHOD FOR 4 CLUSTERS
#Let's now evaluate Average Link technique with 4 clusters using boxplots
model = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='average')
model.fit(ccdatascaled2)
#We can also append these as labels to the unscaled and scaled data (4 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
#Boxplots to show hierarchical clustering by label (4 clusters)
ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
There is still a lot of overlap between the clusters.
#Let's now look at p = 4, and calculate the silhouette coefficient
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=4, # show only the last p merged clusters
)
plt.show()
max_d = 3.3 # After reading the zoomed in dendogram, the dendogram distance seems to be approximately 3.3
#Let's check to see we have 4 different cluster types
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
hc_4_clusters_silh_avg = silhouette_score(ccdatascaled2,clusters)
hc_4_clusters_silh_avg
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.
silhouette_score_hc_avg_4 ={'Metric':['Silhouette Score'], 'Average(4 Clusters)':[hc_4_clusters_silh_avg]}
dataframe4 = pd.DataFrame(silhouette_score_hc_avg_4)
dataframe4
The silhouette score went up for 4 clusters compared to 3. This shows that average link hierarchical clustering and k means are 2 entirely different methods of clustering.
HIERARCHICAL CLUSTERING USING COMPLETE LINKAGE METHOD WITH 3 CLUSTERS
#We can use the complete linkage method and then fit it to the scaled data.
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')
model.fit(ccdatascaled2)
#We can also append these as labels to the unscaled and scaled data (3 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
#This is made to answer the last question of the assignment
ccdata5 = ccdata2
print(ccdata5.head())
print("")
print("Frequency of all 3 labels")
print(ccdata5['labels'].value_counts())
ccdataclust5 = ccdata5.groupby(['labels'])
ccdataclust5.mean()
After clustering with the complete linkage method, we can see how many individuals are in each group. Lower (Group 0) and mid-tier (Group 2) levels when it comes to average credit limit seem to make the bulk of this dataset. The group with the highest average credit limit only has 50 customers in it. The other two have the majority of customers.
#Boxplots to show hierarchical clustering by label (3 clusters) using complete link
ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
The way how these boxplots are displayed shows an improvement in separation between the clusters, compared to the previous plots. The plot for total_visits online shows only a slight overlap between groups 0 and 2, while group 1 is only overlapping with an outlier from group 0 after clustering. does not show that the clusters are completely separated from one another. Please ignore the group boxplot.
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z_complete = linkage(ccdatascaled2, metric='euclidean', method='complete')
c_complete, coph_dists = cophenet(Z_complete , pdist(ccdatascaled2))
c_complete
This is a really high cophenetic correlation coefficient. It suggests that this method is pretty good at clustering and that the dendogram stays faithful to pairwise distances in the data.
#Summarize into a dataframe so that we can merge this information into a bigger dataframe for cophnetic coefficient comparison between models
Cophenetic_coeff_complete ={'Metric':['Cophenetic Coefficient'], 'Complete Linkage CC':[c_complete]}
dataframe12 = pd.DataFrame(Cophenetic_coeff_complete)
dataframe12
The cophenetic coefficient is also very high with the complete method for hierarchical clustering using Euclidean distance. We must now construct the tree dendogram.
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z_complete, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
From this, you can tell that the branching is the same with both methods of agglomerative hierarchical clustering so far. The only major difference is the dendogram distances at which clusters are formed.
#Let's narrow in on the last 3 merged clusters
dendrogram(
Z_complete,
truncate_mode='lastp', # show only the last p merged clusters
p=3, # show only the last p merged clusters
)
plt.show()
max_d = 6.3 #From reading the dendogram, we can see the dendogram distance at 3 clusters is about 6.3
#Let's check to see if there are actually 3 cluster types.
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_complete, max_d, criterion='distance')
clusters
#Let's calculate the silhouette coefficient using the complete link hierarchical clustering method for 3 clusters
hc_3_clusters_silh_complete = silhouette_score(ccdatascaled2,clusters)
hc_3_clusters_silh_complete
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.
silhouette_score_hc_complete_3 ={'Metric':['Silhouette Score'], 'Complete(3 Clusters)':[hc_3_clusters_silh_complete]}
dataframe5 = pd.DataFrame(silhouette_score_hc_complete_3)
dataframe5
#Let's look at the boxplots for 4 clusters using the complete linkage method.
model = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='complete')
model.fit(ccdatascaled2)
#We can also append these as labels to the unscaled and scaled data (4 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
#Boxplots to show hierarchical clustering by label (4 clusters) using complete link
ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
COMPLETE LINKAGE WITH 4 CLUSTERS
#Calcuate the silhouette coefficient
#Let's narrow in on the last 4 merged clusters
dendrogram(
Z_complete,
truncate_mode='lastp', # show only the last p merged clusters
p=4, # show only the last p merged clusters
)
plt.show()
max_d = 4.7 # Distance at which we see 4 clusters surface.
#Let's check to see that there are 4 clusters present
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_complete, max_d, criterion='distance')
clusters
#Silhouette coefficient for 4 clusters using Complete Linkage
hc_4_clusters_silh_complete = silhouette_score(ccdatascaled2,clusters)
hc_4_clusters_silh_complete
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.
silhouette_score_hc_complete_4 ={'Metric':['Silhouette Score'], 'Complete(4 Clusters)':[hc_4_clusters_silh_complete]}
dataframe6 = pd.DataFrame(silhouette_score_hc_complete_4)
dataframe6
So far, 4 clusters yields a lower silhouette score compared to 3 clusters using complete linkage methods. It's the opposite for average linkage.
HIERARCHICAL CLUSTERING USING WARD LINKAGE METHOD
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z_ward = linkage(ccdatascaled2, metric='euclidean', method='ward')
c_ward, coph_dists = cophenet(Z_ward , pdist(ccdatascaled2))
c_ward
#Summarize into a dataframe so that we can merge this information into a bigger dataframe for cophnetic coefficient comparison between models
Cophenetic_coeff_ward ={'Metric':['Cophenetic Coefficient'], 'Ward Linkage CC':[c_ward]}
dataframe13 = pd.DataFrame(Cophenetic_coeff_ward)
dataframe13
The cophenetic correlation coefficient is quite strong for Ward, but complete linkage dendogram is performing better. Let's now draw the ward dendogram.
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z_ward, leaf_rotation=90.,color_threshold=600, leaf_font_size=10. )
plt.tight_layout()
WARD METHOD 3 CLUSTERS
#Let's look at the boxplots for 3 clusters using the ward linkage method.
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
model.fit(ccdatascaled2)
#We can also append these as labels to the unscaled and scaled data (3 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
#Boxplots to show hierarchical clustering by label (3 clusters) using ward link
ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
All clusters represented by boxplots here are overlapping.
#Let's narrow in on the last 3 merged clusters
dendrogram(
Z_ward,
truncate_mode='lastp', # show only the last p merged clusters
p=3, # show only the last p merged clusters
)
plt.show()
max_d = 45 #This is an approximate dendogram distance for 3 clusters.
#Let's check for 3 clusters
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_ward, max_d, criterion='distance')
clusters
#Silhouette score for 3 cluster hierarchical clustering using ward linkage
hc_3_clusters_silh_ward = silhouette_score(ccdatascaled2,clusters)
hc_3_clusters_silh_ward
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.
silhouette_score_hc_ward_3 ={'Metric':['Silhouette Score'], 'Ward(3 Clusters)':[hc_3_clusters_silh_ward]}
dataframe7 = pd.DataFrame(silhouette_score_hc_ward_3)
dataframe7
This is considered a decent silhouette score.
HIERARCHICAL CLUSTERING USING WARD LINKAGE METHOD FOR 4 CLUSTERS
#Let's look at the boxplots for 4 clusters using the ward linkage method.
model = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
model.fit(ccdatascaled2)
#We can also append these as labels to the unscaled and scaled data (4 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
#Boxplots to show hierarchical clustering by label (4 clusters) using ward link
ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
#Let's narrow in on the last 4 merged clusters
dendrogram(
Z_ward,
truncate_mode='lastp', # show only the last p merged clusters
p=4, # show only the last p merged clusters
)
plt.show()
max_d = 18 #This is an approximate dendogram distance for 4 clusters using the ward method.
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_ward, max_d, criterion='distance')
clusters
#Silhouette score for 4 clusters (ward linkage)
hc_4_clusters_silh_ward = silhouette_score(ccdatascaled2,clusters)
hc_4_clusters_silh_ward
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.
silhouette_score_hc_ward_4 ={'Metric':['Silhouette Score'], 'Ward(4 Clusters)':[hc_4_clusters_silh_ward]}
dataframe8 = pd.DataFrame(silhouette_score_hc_ward_4)
dataframe8
COMPARING COPHENETIC COEFFICIENT OF EACH LINKAGE METHOD
#Merging the cophenetic correlation coefficient dataframes for comparison.
Metrics_Dataframe2 = pd.merge(dataframe11,dataframe12,how='outer',on='Metric')
Metrics_Dataframe2 = pd.merge(Metrics_Dataframe2, dataframe13,how='outer',on='Metric')
Metrics_Dataframe2
According to the results, complete linkage provides the highest cophenetic correlation (CPCC = 0.91). This means that of all the models we tested for hierarchical clustering, the dendogram created with complete linkage is the most faithful to the original Euclidean distances between scaled paired observations. This said, the dendogram created with complete linkage is perhaps the most reliable.
COMPARING SILHOUETTE SCORES FOR K MEANS
# Merge K Means dataframes together for comparison of silhouette score comparison
Metrics_Dataframe1 = pd.merge(dataframe1,dataframe2,how='outer',on='Metric')
Metrics_Dataframe1
The higher and closer the silhouette score is to 1, the better it is normally. In this case, k = 3 has the higher silhouette score. Due to this, we would prefer to choose k =3 because we want a medium or a sweet spot between being too compressed (when everything is compressed into 1 cluster) and too accurate (when every point is its own cluster). The cost is not worth the benefit when it comes to increasing the amount of clusters beyond 3. The addition of a 4th cluster is not explaining much of the variation in the data.
COMPARING SILHOUETTE SCORES FOR HIERARCHICAL CLUSTERING
# Merge all dataframes hierarchical clustering dataframes for silhouette score comparison
Metrics_Dataframe3 = pd.merge(dataframe3,dataframe4,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe5,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe6,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe7,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe8,how='outer',on='Metric')
Metrics_Dataframe3
These silhouette scores were calculated for these dendogram models. Normally, we cannot specify the number of clusters beforehand, and we must derive our conclusions about clusters from the visual dendogram itself.
From the above results, 3 clusters had better silhouette coefficients than 4 clusters in only complete linkage, while 4 clusters had better silhouette scores than 3 clusters in both average and ward linkage techniques.
Average linkage in general is influenced by outliers, or high and low numbers because it is reliant on the mean. Complete and ward linkages are robust methods that can handle the presence of outliers. It is important, therefore, to put more confidence in these 2 latter techniques due to their strength to withstand noise. Complete linkage for 3 clusters has a higher silhouette score (+0.04) and is closer to 1 than complete linkage for 4 clusters. (Also, complete linkage had the highest cophenetic correlation coefficient which implies that the dendogram is the most faithful to the Euclidian pairwise distances and is more likely the better dendogram. Please see above.). When it comes to Ward linkage, it can be argued that the addition of a 4th cluster does not add significant explanation in variation for the data and the cost is not worth the benefit.
If we were to compare K means methodology to hierarchical clustering, it would seem that K means would be better because it would lower the computational expense compared to hierarchical clustering. Building a K means model is faster than building a hierarchical model in Python. The big reason is attributed to the number of Euclidian distances that must be computed between observations; K means calculates 3n distances (in this case), while hierarchical dendograms calculates n(n+1)/2 distances. For this reason, hierarchical clusters have a lot of distances to calculate compared to K means. However, hierarchical clustering tends to generate more mathematically intuitive results and have less expected assumptions even if it takes more time to process.
In any case, all Silhouette coefficients are above 0, acknowledging that a decent amount of observations/objects are closer to the points within their own respective cluster than to the other neighboring clusters.
Key Questions:
How many different segments of customers are there? How are these segments different from each other? What are your recommendations to the bank on how to better market to and service these customers?
The clusters are not entirely separated, so we cannot truly conclude that they are independent and separated from each other. However, we can still draw conclusions on the clusters formed from K means and Hierarchical Clustering. K Means and Hierarchical Clustering are both very different processes, so the clusters may yield very different cluster groupings.
To answer this question, we must look back at the unscaled dataset with the appended GROUPS/labels for 3 clusters
#Means of every cluster for each feature using K Means clustering on unscaled data
ccdataclust.mean()
#Medians of every cluster for each feature using K Means clustering on unscaled data
ccdataclust.median()
Here I put the the tables for the means and medians of each feature grouped by cluster for K means
Group 0 is the middle group. They have a moderate average credit limit. They have an average and median of 6 credit cards. They also seem to prefer human interaction than banking on their own. Of all the clusters, this cluster makes more calls to the bank, and have more bank visits.
Group 1 may be the most financially liable of all the groups. They have on average the lowest average credit limit and own the least amount of credit cards. They tend to call the bank more often than use other methods of banking.
Group 2 seems to have very high average credit limits. They also are the group with the most credit cards. They give the impression that they are trustworthy, pay their bills on time which leads to higher average credit limit, and always get credit card offers. They also accept a lot of credit card offers. They don't visit the bank too often and don't call the bank often and instead, they access the bank's website instead.
I think it would be better if we had more information about the clientele, such as job type, age, and mortgage. These factors could perhaps improve segmentation.
Let's now look at the segmentation done by Hierarchical Clustering, using Complete Linkage (the best modeled dendogram from the analysis)
#Means of every cluster for each feature using Hierarchical clustering (Complete Link) on unscaled data
ccdataclust5.mean()
#Medians of every cluster for each feature using Hierarchical clustering (Complete Link) on unscaled data
ccdataclust5.median()
The groups are labeled differently in hierarchical clustering but the patterns seen from k means is consistent. Group 0 from K means is also Group 0 here. Group 1 from K means is Group 2 here. Group 2 from K means is Group 1 here. Like what we saw in K means, the group with the highest average credit limit spend more time banking online. Those in the middle group spend more time calling the bank and paying visits to the bank. Calling the bank is also the preferred banking method for the group with higher liability.
Recommendations for the bank:
The target group: There seems to be 3 major segmentations among the customers in this banking dataset. Among these 3 clusters, the bank should focus on investing more money into reaching out to the middle tier of individuals. Those that are in the upper bracket have on average 9 credit cards already and may be less inclined to participate in another credit card promotion, despite having a high average credit limit across cards. It may be difficult to upsell to these individuals. This group is also very small in numbers. There are only approximately 50 individuals in this group. Those who belong to the lowest tier when it comes to average credit limit may be a financial liability because they have lower average credit limit, even though they may be more inclined to participate because they have less credit cards.
Advertisements suggestions: The bank should distribute advertisement funds across all 3 platforms but should put more emphasis on in-bank posters, phone commercial breaks while callers are on hold, and banking representatives promoting the deal by phone and in person. Online ads would only for the most part attract individuals of the extremes (the upper and lower tiers). The upper tier consists only a small portion of individuals and the lower tier could be a financial risk. Nonetheless, the payments for online ads should not be halted, as it could possibly attract other new customers with promising prospects.